Word-length entropies and correlations of natural language written texts
نویسندگان
چکیده
We study the frequency distributions and correlations of the word lengths of ten European languages. Our findings indicate that a) the word-length distribution of short words quantified by the mean value and the entropy distinguishes the Uralic (Finnish) corpus from the others, b) the tails at long words, manifested in the high-order moments of the distributions, differentiate the Germanic languages (except for English) from the Romanic languages and Greek and c) the correlations between nearby word lengths measured by the comparison of the real entropies with those of the shuffled texts are found to be smaller in the case of Germanic and Finnish languages.
منابع مشابه
Entropy analysis of word-length series of natural language texts: Effects of text language and genre
We estimate the n-gram entropies of natural language texts in word-length representation and find that these are sensitive to text language and genre. We attribute this sensitivity to changes in the probability distribution of the lengths of single words and emphasize the crucial role of the uniformity of probabilities of having words with length between five and ten. Furthermore, comparison wi...
متن کاملCan Zipf Analyses and Entropy Distinguish between Artiicial and Natural Language Texts?
We study statistical properties of natural texts written in English and of two types of artiicial texts. As statistical tools we use the conventional and the inverse Zipf analyses, the Shannon entropy and a quantity which is a nonlinear function of the word frequencies , the frequency relative \entropy". Our results obtained by investigating eight complete books and sixteen related artiicial te...
متن کاملWord-Length Correlations and Memory in Large Texts: A Visibility Network Analysis
We study the correlation properties of word lengths in large texts from 30 ebooks in the English language from the Gutenberg Project (www.gutenberg.org) using the natural visibility graph method (NVG). NVG converts a time series into a graph and then analyzes its graph properties. First, the original sequence of words is transformed into a sequence of values containing the length of each word, ...
متن کاملThe word entropy of natural languages
The average uncertainty associated with words is an informationtheoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty also called average information content. We here use parallel texts of 21 languages to establish the number of tokens at which word entropies converge to stable values. These converg...
متن کاملEntropy, Transinformation and Word Distribution of Information-Carrying Sequences
We investigate correlations in information carriers, e.g. texts and pieces of music, which are represented by strings of letters. For information carrying strings generated by one source (i.e. a novel or a piece of music) we find correlations on many length scales. The word distribution, the higher order entropies and the transinformation are calculated. The analogy to strings generated through...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of Quantitative Linguistics
دوره 22 شماره
صفحات -
تاریخ انتشار 2015